Faster, Cheaper, Better – a Hybridization Methodology to Develop Linear Algebra Software for GPUs
نویسندگان
چکیده
3 4 for (n = 0 ; n < nt ; n++) // l oop on c o l s 5 for (m = 0 ; m < mt ; m++) // l oop on rows 6 s t a r pu ma t r i x da t a r e g i s t e r (& t i l e h a nd l e [m] [ n ] , 0 , 7 &t i l e [m] [ n ] , M, M, N, s izeof ( f loat ) ) ; Figure 3: Registration of the tiles as handles of matrix data type. Initialization. When initializing StarPU with starpu_init, StarPU automatically detects the topology of the machine and launches one thread per processing unit to execute the tasks. Data registration. Each tile is registered into StarPU to be associated to a handle. As shown in Figure 3, the tile_handle[m][n] StarPU abstraction is obtained from each actual memory pointer, tile[m][n]. Several data types are pre-defined for the handles. Here, tiles are registered as matrices since a submatrix is itself a matrix. Codelets definition. As shown at lines 38-42 for the sgemm_codelet in Figure 4, a codelet is a structure that describes a multi-versioned kernel (sgemm here). It contains pointers to the functions that implement the kernel on the different types of units: lines 1-14 for the CPU and 1630 for the GPU. The prototype of these functions is fixed: an array of pointers to the data interfaces that describe the local data replicates, followed by a pointer to some user-provided argument for the codelet. The STARPU_MATRIX_GET_PTR is a helper function that takes a data interface in the matrix format and returns the address of the local copy. Function starpu_unpack_cl_args is also a helper function that retrieves the arguments stacked in the cl_arg pointer by the application. Those arguments are passed when the tasks are inserted. Tasks insertion. In StarPU, a task consists of a codelet working on a list of handles. The access mode (e.g., read-write) of each handle is also required so that the runtime can compute the dependences between tasks. A task may also take values as arguments (passed through pointers). A task is inserted with the starpu_insert_Task function. 1 Lines 33-41 in Figure 5 shows how the sgemm task is inserted. The first argument is the codelet, sgemm_codelet. The following arguments are either values (keyword VALUE) or handles (when an access mode is specified). For instance, a value is specified at line 34, corresponding to the content of the notrans Other interfaces not discussed here are available. 7 in ria -0 05 47 84 7, v er si on 1 17 D ec 2 01 0 1 void sgemm cpu func (void ∗desc r [ ] , void ∗ c l a r g ) { 2 int transA , transB , M, N, K, LDA, LDB, LDC; 3 f loat alpha , beta , ∗A, ∗B, ∗C; 4 5 A = STARPU MATRIX GET PTR( desc r [ 0 ] ) ; 6 B = STARPU MATRIX GET PTR( desc r [ 1 ] ) ; 7 C = STARPU MATRIX GET PTR( desc r [ 2 ] ) ; 8 9 s t a rpu unpack c l a r g s ( c l a r g , &transA , &transB , &M, 10 &N, &K, &alpha , &LDA, &LDB, &beta , &LDC) ; 11 12 sgemm( CblasColMajor , transA , transB , M, N, K, 13 alpha , A, LDA, B, LDB, beta , C, LDC) ; 14 } 15 16 void sgemm cuda func (void ∗desc r [ ] , void ∗ c l a r g ) { 17 int transA , transB , M, N, K, LDA, LDB, LDC; 18 f loat alpha , beta , ∗A, ∗B, ∗C; 19 20 A = STARPU MATRIX GET PTR( desc r [ 0 ] ) ; 21 B = STARPU MATRIX GET PTR( desc r [ 1 ] ) ; 22 C = STARPU MATRIX GET PTR( desc r [ 2 ] ) ; 23 24 s t a rpu unpack c l a r g s ( c l a r g , &transA , &transB , &M, 25 &N, &K, &alpha , &LDA, &LDB, &beta , &LDC) ; 26 27 cublasSgemm( magma const [ transA ] [ 0 ] , magma const [ transB ] [ 0 ] , 28 M, N, K, alpha , A, LDA, B, LDB, beta , C, LDC) ; 29 cudaThreadSynchronize ( ) ; 30 } 31 32 struct s t a rpu pe r fmode l t cl sgemm model = { 33 . type = STARPU HISTORY BASED, 34 . symbol = ”sgemm” 35 } ; 36 37 s t a rpu code l e t sgemm codelet = { 38 . where = STARPU CPU|STARPU CUDA, // who may e x e cu t e ? 39 . cpu func = sgemm cpu func , // CPU imp lementa t ion 40 . cuda func = sgemm cuda func , // CUDA imp lementa t ion 41 . nbu f f e r s = 3 , // number o f hand l e s a c c e s s e d by t h e t a s k 42 . model = &cl sgemm model // per formance model ( o p t i o n a l ) 43 } ; Figure 4: A codelet implementing sgemm kernel. 8 in ria -0 05 47 84 7, v er si on 1 17 D ec 2 01 0 1 void hybr id cho l e sky ( s ta rpu data hand le ∗∗Ahandles , 2 int M, int N, int Mt, int Nt , int Mb) 3 { 4 int lower = Lower ; int upper = Upper ; int r i gh t = Right ; 5 int notrans = NoTrans ; int con j t r ans = ConjTrans ; 6 int nonunit = NonUnit ; f loat one = 1 .0 f ; f loat mone = −1.0 f ; 7 8 int k , m, n , temp ; 9 for ( k = 0 ; k < Nt ; k++) 10 { 11 temp = k == Mt−1 ? M−k∗Mb : Mb ; 12 s t a rpu In s e r t Task ( s p o t r f c o d e l e t , 13 VALUE, &lower , s izeof ( int ) , VALUE, &temp , s izeof ( int ) , 14 INOUT, Ahandles [ k ] [ k ] , VALUE, &Mb, s izeof ( int ) , 0) ; 15 16 for (m = k+1; m < Nt ; m++) 17 { 18 temp = m == Mt−1 ? M−m∗Mb : Mb ; 19 s t a rpu In s e r t Task ( s t r sm code l e t , 20 VALUE, &r ight , s izeof ( int ) , VALUE, &lower , s izeof ( int ) , 21 VALUE, &conjt rans , s izeof ( int ) , VALUE, &nonunit , s izeof ( int ) , 22 VALUE, &temp , s izeof ( int ) , VALUE, &Mb, s izeof ( int ) , 23 VALUE, &one , s izeof ( f loat ) , INPUT, Ahandles [ k ] [ k ] , 24 VALUE, &Mb, s izeof ( int ) , INOUT, Ahandles [m] [ k ] , 25 VALUE, &Mb, s izeof ( int ) , 0) ; 26 } 27 28 for (m = k+1; m < Nt ; m++) 29 { 30 temp = m == Mt−1 ? M−m∗Mb : Mb; 31 for (n = k+1; n < m; n++) 32 { 33 s t a rpu In s e r t Task ( sgemm codelet , 34 VALUE, ¬rans , s izeof ( notrans ) , 35 VALUE, &conjt rans , s izeof ( con j t r ans ) , 36 VALUE, &temp , s izeof ( int ) , VALUE, &Mb, s izeof ( int ) , 37 VALUE, &Mb, s izeof ( int ) , VALUE, &mone , s izeof ( f loat ) , 38 INPUT, Ahandles [m] [ k ] , VALUE, &Mb, s izeof ( int ) , 39 INPUT, Ahandles [ n ] [ k ] , VALUE, &Mb, s izeof ( int ) , 40 VALUE, &one , s izeof ( one ) , INOUT, Ahandles [m] [ n ] , 41 VALUE, &Mb, s izeof ( int ) , 0) ; 42 } 43 44 s t a rpu In s e r t Task ( s s y rk code l e t , 45 VALUE, &lower , s izeof ( int ) , VALUE, ¬rans , s izeof ( int ) , 46 VALUE, &temp , s izeof ( int ) , VALUE, &Mb, s izeof ( int ) , 47 VALUE, &mone , s izeof ( f loat ) , INPUT, Ahandles [m] [ k ] , 48 VALUE, &Mb, s izeof ( int ) , VALUE, &one , s izeof ( f loat ) , 49 INOUT, Ahandles [m] [m] , VALUE, &Mb, s izeof ( int ) , 0) ; 50 } 51 } 52 53 s t a r p u t a s k w a i t f o r a l l ( ) ; 54 } Figure 5: Actual implementation of the tile Cholesky hybrid algorithm with StarPU 9 in ria -0 05 47 84 7, v er si on 1 17 D ec 2 01 0 variable. On the right of line 40, the handle of the tile (m,n) is passed in readwrite mode (key-word INOUT). Figure 5 is a complete implementation of the tile Cholesky algorithm from Figure 1, showing the ease of programmability. Finalization. Once all tasks have been submitted, the application can perform a barrier using the starpu_task_wait_for_all() function (line 53 in Figure 5). When it returns, we can stop maintaining data coherency and put the tiles back into main memory by unregistering the different data handles. Calling starpu_shutdown() releases all the resources. Choice or design of a scheduling strategy. Once the above steps have been completed, the application is fully defined and can be executed as it is. However, the choice of a strategy may be critical for performance. StarPU provides several built-in pre-defined strategies that the user can select during the initialization, depending on the specificities and requirements of the application. When the performance of the kernels is stable enough to be predictable directly from the previous executions (as it is the case with Tile Cholesky factorization), one may associate an auto-tuned history-based performance model to a codelet as shown at lines 32-35 and 42 in factorization), one should associate an auto-tuned history-based performance model to a codelet as shown on lines 32-35 and 42 in Figure 4. If all codelets are associated to a performance model, it is then possible to schedule the tasks according to their expected termination time. The most efficient scheduling strategy (among those available in StarPU) for the Cholesky factorization is based on the standard Heterogeneous Earliest Finish Time (HEFT) [6] scheduling heuristic which aims at minimizing the termination time of the tasks on heterogeneous platforms. Given the impact of data transfers, especially when it comes to multiple accelerators, we extended this policy to take data transfer into account and keep it as low as possible. StarPU also provides a framework to develop ad hoc scheduling strategies in a high-level way, but the methodology to write a scheduler in StarPU is out of the scope of this description.
منابع مشابه
Investigating the Effects of Hardware Parameters on Power Consumptions in SPMV Algorithms on Graphics Processing Units (GPUs)
Although Sparse matrix-vector multiplication (SPMVs) algorithms are simple, they include important parts of Linear Algebra algorithms in Mathematics and Physics areas. As these algorithms can be run in parallel, Graphics Processing Units (GPUs) has been considered as one of the best candidates to run these algorithms. In the recent years, power consumption has been considered as one of the metr...
متن کاملLeading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations
Today’s high computational demands from engineering fields and complex hardware development make it necessary to develop and optimize new algorithms toward achieving high performance and good scalability on the next generation of computers. The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software tha...
متن کاملAccelerating GPU Kernels for Dense Linear Algebra
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU specific optimiz...
متن کاملHybridization of Facial Features and Use of Multi Modal Information for 3D Face Recognition
Despite of achieving good performance in controlled environment, the conventional 3D face recognition systems still encounter problems in handling the large variations in lighting conditions, facial expression and head pose The humans use the hybrid approach to recognize faces and therefore in this proposed method the human face recognition ability is incorporated by combining global and local ...
متن کاملA scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
Aiming to fully exploit the computing power of all CPUs and all GPUs on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, as well as to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentr...
متن کاملAutotuning dense linear algebra libraries on GPUs
As GPUs are quickly evolving in complexity, tuning numerical libraries for them is becoming more challenging. We present an autotuning approach in the area of dense linear algebra (DLA) libraries for GPUs. The MAGMA library is used to demonstrate the techniques and their effect on performance and portability across hardware systems. We show that, figuratively speaking, our autotuning approach f...
متن کامل